AITopics | cost efficiency

Collaborating Authors

cost efficiency

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Efficient Training-Free Online Routing for High-Volume Multi-LLMServing

Neural Information Processing SystemsJun-22-2026, 17:49:37 GMT

Increasing demand for Large Language Models (LLMs) services imposes substantial deployment and computation costs on providers. LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features. However, existing works primarily focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets. In this work, we introduce the first training-free algorithm for online routing scenarios. Our algorithm leverages approximate nearest neighbor search to efficiently estimate query features and performs a one-time optimization over a small set of initial queries to learn a routing strategy that guides future routing. We provide theoretical guarantees demonstrating that our algorithm achieves a competitive ratio of 1 o(1)under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of 3.55 in overall performance, 1.85 in cost efficiency, and nearly 4.25 in throughput. Our code is available at https://github.com/fzwark/PORT.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Neural Information Processing SystemsJun-19-2026, 07:25:20 GMT

Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edgeassisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91 through achieving 2.22 server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Park, Jinwoo, Cho, Seunggeun, Han, Dongsu

arXiv.org Artificial IntelligenceNov-19-2025

Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving. The code is available at https://github.com/kaist-ina/specedge

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.17052

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

Li, Xiangchen, Spatharakis, Dimitrios, Ghafouri, Saeid, Fan, Jiakun, Vandierendonck, Hans, John, Deepu, Ji, Bo, Nikolopoulos, Dimitrios

arXiv.org Artificial IntelligenceNov-6-2025

The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware capabilities. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose \acronym, a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server verifies the tokens utilizing a more precise target model. To further increase the efficiency of verification, the edge server batch the diverse verification requests from devices. This approach supports device heterogeneity and reduces server-side memory footprint by sharing the same upstream target model across multiple devices. Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits: 2.2 more system throughput, 2.8 more system capacity, and better cost efficiency, all without sacrificing model accuracy.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.09397

Country:

Europe (0.94)
North America > United States > Virginia (0.31)

Genre: Research Report > Promising Solution (0.34)

Industry:

Energy (0.93)
Information Technology > Hardware (0.56)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading

Kim, Kihyun, Kim, Jinwoo, Chung, Hyunsun, Cha, Myung-Hoon, Kim, Hong-Yeon, Kim, Youngjae

arXiv.org Artificial IntelligenceApr-17-2025

LLM inference is essential for applications like text summarization, translation, and data analysis, but the high cost of GPU instances from Cloud Service Providers (CSPs) like AWS is a major burden. This paper proposes InferSave, a cost-efficient VM selection framework for cloud based LLM inference. InferSave optimizes KV cache offloading based on Service Level Objectives (SLOs) and workload charac teristics, estimating GPU memory needs, and recommending cost-effective VM instances. Additionally, the Compute Time Calibration Function (CTCF) improves instance selection accuracy by adjusting for discrepancies between theoretical and actual GPU performance. Experiments on AWS GPU instances show that selecting lower-cost instances without KV cache offloading improves cost efficiency by up to 73.7% for online workloads, while KV cache offloading saves up to 20.19% for offline workloads.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.11816

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Services (0.86)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving

Mei, Kai, Xu, Wujiang, Lin, Shuhang, Zhang, Yongfeng

arXiv.org Artificial IntelligenceMar-7-2025

As large language models (LLMs) are increasingly deployed as service endpoints in systems, the surge in query volume creates significant scheduling challenges. Existing scheduling frameworks mainly target at latency optimization while neglecting the capability of LLMs to serve different level of queries, which could lead to computational resource waste. This paper addresses this challenge by proposing a capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving, which explicitly constrains response quality and workload to optimize LLM inference cost. Specifically, it introduces the two-stage scheduling by designing a multi-objective predictor and a constrained optimizer. The predictor estimates both model capabilities and computational costs through training-based and retrieval-based approaches, while the optimizer determines cost-optimal assignments under quality and workload constraints. It also introduces QAServe, a dataset collected for sample-wise response quality and costs by zero-shot prompting different LLMs on knowledge QA and mathematical reasoning. Extensive experiments demonstrate that ECCOS improves success rates by 6.30% while reducing costs by 10.15% compared to existing methods, consuming less than 0.5% of LLM response time. The code is available at: https://github.com/agiresearch/ECCOS.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2502.20576

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Enhancing Supply Chain Resilience with Metaverse and ChatGPT Technologies

Sarhir, Oumaima

arXiv.org Artificial IntelligenceDec-31-2024

Global supply lines have been severely disrupted by the COVID-19 epidemic and the conflict between Russia and Ukraine, which has sharply increased the price of commodities and generated inflation. These incidents highlight how critical it is to improve supply chain resilience (SCRES) in order to fend off unforeseen setbacks. Controlling both internal and external interruptions, such as transportation problems brought on by natural catastrophes and wars, is the responsibility of SCRES. Enhancing resilience in supply chains requires accurate and timely information transfer. Promising answers to these problems can be found in the Metaverse and ChatGPT, two new digital technologies. The Metaverse may imitate real-world situations and offer dynamic, real-time 3D representations of supply chain data by integrating blockchain, IoT, network connection, and computer power.Large-scale natural language processing model ChatGPT improves communication and data translation accuracy and speed. To manage risk and facilitate decision making in Supply Chain management, firms should increase information transmission, Speed and quality. This study aim to show the importance of ChatGPT and Metaverse technologies to improve SCRES, with an emphasis on the most important criteria for SCRES, and maturity factor that can influence directly the SC development.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.14777

Country:

Europe > Ukraine (0.24)
Europe > Russia (0.24)
Asia > Russia (0.24)
(2 more...)

Genre: Research Report (1.00)

Industry:

Banking & Finance (1.00)
Information Technology > Security & Privacy (0.93)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.34)
Health & Medicine > Therapeutic Area > Immunology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AI-driven innovation in medicaid: enhancing access, cost efficiency, and population health management

Ingole, Balaji Shesharao, Ramineni, Vishnu, Krishnappa, Manjunatha Sughaturu, Jayaram, Vivekananda

arXiv.org Artificial IntelligenceOct-11-2024

Medicaid is a federal-state program that provides healthcare to over 80 million low-income Americans, including pregnant women, children, and individuals with disabilities. Up against a host of problems, including rising healthcare costs, disparity in access, and the management of chronic conditions among at-risk groups, Medicaid is one of the biggest healthcare payers in the U.S. Just as Medicare does, the use of Artificial Intelligence (AI) offers a major opportunity to change the delivery of care and operational efficiency in Medicaid [1] [16]. While there has been extensive conversation about AI in Medicare, the unique population and requirements of Medicaid require customized AI applications [1]. Chronic disease management, improving admin tasks, and a reduction in costs are amongst the ways AI tools can help, especially by focusing on social determinants of health (SDOH) that are important for Medicaid populations. The study will assess the ability of AI-enabled systems to reinforce Medicaid in handling its particular challenges while facilitating fair and quality care for its entire population of beneficiaries [8] [9].

artificial intelligence, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2410.21284

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York (0.05)
North America > United States > Texas > Collin County > Plano (0.05)
(13 more...)

Genre:

Research Report (0.65)
Overview > Innovation (0.52)

Industry:

Health & Medicine > Health Care Providers & Services > Reimbursement (1.00)
Health & Medicine > Government Relations & Public Policy (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Griggs, Tyler, Liu, Xiaoxuan, Yu, Jiaxiang, Kim, Doyoung, Chiang, Wei-Lin, Cheung, Alvin, Stoica, Ion

arXiv.org Artificial IntelligenceJun-27-2024

Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing M\'elange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, M\'elange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting.

cost efficiency, request size, slo, (14 more...)

arXiv.org Artificial Intelligence

2404.14527

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Efficient and Economic Large Language Model Inference with Attention Offloading

Chen, Shaoyuan, Lin, Yutong, Zhang, Mingxing, Wu, Yongwei

arXiv.org Artificial IntelligenceMay-2-2024

Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially as context length increases. To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop Lamina, an LLM inference system that incorporates attention offloading. Experimental results indicate that Lamina can provide 1.48x-12.1x higher estimated throughput per dollar than homogeneous solutions.

accelerator, attention operator, inference, (15 more...)

arXiv.org Artificial Intelligence

2405.01814

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback